Evaluating Model Robustness to Image Quality Degradation in Histological Classification

Image 14

Published

June 5, 2027

Code
library(readr)
library(dplyr)
library(ggplot2)
library(purrr)
library(stringr)
library(cowplot)

metrics <- read_csv("../metrics/combined_report_metrics.csv")
Code
blur_plot_data <- metrics %>%
  select(test_set, blur_size, noise_level, accuracy, Model_Label) %>%
  filter(Model_Label %in% c("RF (PCA)", "XGBoost (PCA)", "CNN", "ResNet")) %>%
  group_by(blur_size, noise_level, Model_Label) %>%
  summarise(accuracy = mean(accuracy), .groups = "drop") %>%
  mutate(
    noise_level = as.factor(noise_level),
    Model_Label  = factor(Model_Label)
  )

print(ggplot(blur_plot_data, aes(x = blur_size, y = accuracy, color = noise_level, group = noise_level)) +
  geom_line(linewidth = 1) +
  geom_point(size = 2) +
  facet_wrap(~ Model_Label, ncol = 2) +
  labs(
    x = "Blur Radius",
    y = "Accuracy",
    color = "Noise Level"
  ) +
    scale_color_brewer(palette = "Blues") +
  theme_minimal(base_size = 13) +
  theme(
    strip.text = element_text(face = "bold", size = 12),
    legend.position = "right"
  ))
Figure 1: Accuracy vs Blur Radius by Noise Level, faceted by Model.
Code
# need to fix caption here 

# parse confusion matrices
parse_cm <- function(cm_str) {
  nums <- as.numeric(unlist(str_extract_all(cm_str, "\\d+")))
  matrix(nums, nrow = 4, byrow = TRUE)
}

# average across test sets
get_avg_cm <- function(df, model_name, blur_val, noise_val) {
  df %>%
    filter(Model_Label == model_name, blur_size == blur_val, noise_level == noise_val) %>%
    pull(confusion_matrix) %>%
    map(parse_cm) %>%
    reduce(`+`) %>%
    `/`(3)  # divide by 3 test sets
}

plot_cm <- function(cm, title) {
  df <- as.data.frame(as.table(cm))
  colnames(df) <- c("True", "Predicted", "Freq")
  df$True      <- factor(as.integer(df$True),
                         levels=1:4,
                         labels=c("Immune","Other","Stromal","Tumour"))
  df$Predicted <- factor(as.integer(df$Predicted),
                         levels=1:4,
                         labels=c("Immune","Other","Stromal","Tumour"))
  ggplot(df, aes(x=Predicted, y=True, fill=Freq)) +
    geom_tile(color="white") +
    geom_text(aes(label=round(Freq, 1)), size = 4) +
    scale_fill_gradient(low="white", high="dodgerblue") +
    coord_fixed() +
    labs(title=title) +
        theme_minimal(base_size = 12) +
    theme(
      axis.title       = element_blank(),
      legend.position  = "none",
      plot.title       = element_text(hjust=0.5,
                                      size=12,
                                      face="bold",
                                      margin=margin(b=5)),
      plot.margin      = margin(t=5, r=5, b=5, l=5),

      #nlarge & rotate x‐labels so they don't collide
      axis.text.x      = element_text(
                           size = 10,
                           angle = 45,
                           hjust = 1,
                           vjust = 1,
                           margin = margin(t = 5)
                         ),
      #give y‐labels a bit more breathing room
      axis.text.y      = element_text(
                           size = 10,
                           margin = margin(r = 5)
                         )
    )
}

# Generate plots for all models and both augmentation levels
p1 <- plot_cm(get_avg_cm(metrics, "XGBoost (PCA)", 0, 0), "XGBoost – No Augmentation")
p2 <- plot_cm(get_avg_cm(metrics, "XGBoost (PCA)", 19, 30), "XGBoost – Max Augmentation")

p3 <- plot_cm(get_avg_cm(metrics, "ResNet", 0, 0), "ResNet – No Augmentation")
p4 <- plot_cm(get_avg_cm(metrics, "ResNet", 19, 30), "ResNet – Max Augmentation")

combined1 <- plot_grid(
  p1, p3, p2, p4,
  nrow = 2,
  align = "hv",
  axis  = "tblr"
)

# axis labels
labeled1 <- add_sub(combined1, "Predicted Class", vpadding = grid::unit(1, "lines"))
labeled1 <- ggdraw(labeled1) +
  draw_label("True Class", angle = 90, x = 0, y = 0.5, vjust = 1.5)

labeled1
Figure 2: True and Predicted label predictions averaged across test sets.

1 Executive Summary

Diagram illustrating the experimental workflow and deployment pipeline. Each step is annotated with the corresponding Python script used, enabling full reproducibility of the process.

2 Background

The medical field is increasingly turning to Computer Vision models for classification tasks, often distinguishing cell types. With rapid increases in the field, models have come to accuracies of up to 98.5%1, slightly above that of an experienced human pathologist2. The question becomes, can these models maintain their performance in day-to-day practice? In real life, cell images do not always have perfect quality. Investigating how drops in image quality impact classification performance will give medical imagers a better view of how these models will perform on real data.

To address this issue, it is first necessary to determine the common causes and kinds of image degradation. These vary greatly depending on the situation, for example, motion blur is a typical challenge for MRIs3. This report, however, centers on histological H&E stained tissue slides, specifically of breast cancer tissue4. The major cause of quality issues in this case is ‘Whole Slide Imaging’ or WSI causing blur and noise.5

WSI is the common practice of scanning full microscope slides, rather than scanning section by section.5 It has reduced the processing time of a single slide to mere minutes, but there are trade-offs in image quality. Focal points are the points where the camera centers its focus, they are selected automatically or manually. As microscope slides are three-dimensional, if a focus point is selected on a region with different depth to the typical focus depth of the slide, its neighbouring areas will be out of focus. This will lead to blur issues, and more prominent noise due to the camera’s failure to fully capture high frequency information like texture and edges.6

There are ways to mitigate image quality drops: increasing the number of focal points will reduce the prominence of the issue, but not eradicate it entirely. Further, this will also increase processing time, inconveniencing patients and adding delays on overloaded labs.5 Imaging technology is advancing quickly, but with their prohibitive expense, the issue is likely to persist.

3 Method

Data was sourced originally from the Gene Expression Omnibus, a source of publicly available biomedical data. The images were taken with a Xenium Analyser. This gives spatial gene information as well as high resolution imaging4, so will likely have higher image quality than a typical testing set taken with WSI. The images were of whole slides, but were segmented and cropped into individual cell types and labelled accordingly7. While images of magnification levels 50 and 100 were available, only 100 level images were selected due to their higher historical performance8 and computational constraints. Analysis was performed on these labelled images.

How did you design the classification task? (CELL GROUPING)

To sample the data, first images were combined in their respective classes: tumour, stromal, immune or other. Each class was then shuffled to obtain a random sample of each class. To balance model performance and computational constraints, 20,000 images were taken with 75% for training, 10% for validation and 15% for testing. All categories had equal representation across all samples. These datasets were identical for all models.

Images were read in as pixels, and resized to 224x224 pixels with Lanczos downsampling9 to ensure that augmentation effects were standard among all images.

To assess the impact of reduced image quality on classification performance, two classical machine learning models and two deep learning models were trained to compare their robustness and find if one had an advantage over the other.

For the deep learning models, a custom Convolutional Neural Network (CNN)10 and a ResNet50 model pretrained on the ImageNet dataset11 were selected. This gave one shallower model trained expressly for the given classification task, and one deeper model trained on 14 million diverse images, to test a spectrum of deep learning architectures. Training and testing images for these models were normalised with ImageNet normalisation parameters, required by the ResNet5012 and applied to the CNN for comparability of results.

For machine learning models, an XGBoost13 and a Random Forest14 model were trained to compare a traditional method with a more modern one. Three methods of model input were tested: raw pixels, Histogram of Gradients (HOG) and Principal Component Analysis (PCA)15.

PCA gave the best performance while minimising space, converting 50,000 pixels to 100 components and accounting for ~60% of variance in the dataset. Images were flattened, then the principal components (PC) were fit to the training dataset. Testing images were linearly transformed with the pre-fit PC, not refit to the testing data. Refitting to the testing data would risk data leakage, or otherwise produce a non-compatible linear transformation and thus would give poor performance. Although out of the scope of this study, each PC can be visualised as an image, and transformed images can give clarity into what the model sees and decides by. Details can be found in the appendix [15].

{FIG appendix: pca plot, images of top 10 pcs, pca reconstructed image on 0 augmentation vs full augmentation}

All models underwent extensive hyperparameter optimisation, including nested cross-validation, and early-stopping for deep learning models using the validation set.

Models were tested using the same testing set at different augmentation levels and combinations of Gaussian blur and noise to simulate WSI damage6. Initially both were tested at small increments, which were increased once a pattern emerged to manage computation, ending with blur radius levels of 0, 1, 3, 5, 7, 9 and 19 pixels and noise standard deviations of 0, 1, 3, 5, 10, 20 and 30.

To evaluate overall performance, metrics of accuracy, confusion matrices, average maximum confidence and weighted F1, precision, recall scores were taken. For a per class breakdown, metrics of precision, recall, f1, average confidence and standard deviation and prediction count were taken.

4 Results

From [figure], it is clear that deep learning models perform best with no augmentations, but performance will drop near instantly once augmentations are applied. Machine learning models, on the other hand, remain relatively robust across most augmentations. These patterns were replicated in F1 and weighted recall scores, and mostly in precision.

4.0.1 Deep Learning

Noise causes the sharpest drop in accuracy for the CNN and ResNet50 models, with ResNet50 dropping to 30% after noise reaches a standard deviation of 1. The CNN has a slower impact, having negligible impact from noise levels up to 3, then dropping steeply. Both models are highly effective in extracting spatial features from data - relating the information of several neighbouring pixels rather than individual values. Due to these close pixel relationships in the CNNs, it makes sense that slight changes to each pixel’s colour - and subsequently all surrounding pixels - would have a significant impact on the model’s performance. Other research has correlated these findings, such as16 who showed that using denoising software NoisyEnsembles improved their CNN performance.

Similarly to noise, blur caused a quick drop-off in the CNN models’ performances. While not as pronounced nor severe as with noise, there was still a marked decrease across all metrics after blur increased in kernel size beyond 9 pixels. This result is not unsurprising as even to the human eye, this level of blurring appears to be significant. This is consistent with the performance seen above with Gaussian noise applied, however to a less extreme extent as neighbouring pixels are not being altered using the same normally random distribution, rather as a function of their neighbouring colour channels. As a result the impact of blurring is less intense at low levels than noise, with only minor performance decreases.

Several other studies have similar findings, showing that models trained using sharp images struggle to generalise with images that have been blurred.17 interestingly tested the effects of training CNNs using blurred images originally, then sharp images later and found that these models consistently performed better than those that were trained using only sharp images (as we have done in our investigation).

4.0.2 Machine Learning

On the other hand, although starting at lower accuracies the machine learning models are more robust to drops in quality. Both models have limited drops with a blur kernel size of up to 5 and hardly any impact from noise levels of up to 30.

The robustness of these models is most likely due to the Principal Component dimension reduction. This makes intuitive sense once visualised in [fig 3], it is clear that once the principal component transformation has been applied. The PCA transformed images appear blurred, and effectively lose their noise. As the Principal Components (PCs) are global - capturing the whole image, noise will simply be discarded. Comparatively, a CNN based model will examine local patterns which can be distorted by localised noise. Notably, PCA is often used as a denoising technique18, which explains its robustness to noise in classification tasks.

The machine learning models are slightly less robust to blur. As blur increases, the linear transformation can detect less information. However, even at extreme levels where it is uninterpretable to the human eye, the model can still extract colour, spatial and intensity information. Again, due to its global nature it is able to capture larger scale features, so it can continue extracting relevant patterns even with limited local information.

Therefore in the case of image quality concerns, the PCA transformed XGBoost model will give the best result, otherwise ResNet50 will outperform all other models when image quality is high.

{FIG 4: CONFUSION MATRICES FOR 0 AUG TO FULL AUG} Why do deep learning models default to immune and stromal? XGBoost ends up leaning towards tumour and immune RF will rarely predict ‘other’, goes toward tumour and stromal and stops predicting immune.

5 Application Development

The application is designed to complement the real-world workflows of medical imaging professionals and diagnosticians, where visual assessment is central. By pairing augmented images with model performance metrics, it bridges the gap between human judgement and machine classification, supporting interdisciplinary decision-making. The backend integrates two key pipelines: a pre-computed (but dynamically updatable) metrics pipeline, and pre-trained models that allow new predictions from user-uploaded images.

Page 1 displays example cell images from each class under user-selected blur and noise levels, alongside graphical performance summaries to allow quick model comparison under the augmentations applied.

Page 2 provides per-model class-level performance, including confidence and confusion matrices, supporting evaluation of model reliability and cell-type-specific performance as required.

Page 3 allows users to upload an image and view predictions across all augmentation combinations, helping assess model behaviour under novel, real-world image conditions.

6 Discussion

Some limitations of the experiment are that it is specific to blurring and noise type image degradation in the context of breast tissue slides. Different tasks may have different image quality issues, or image characteristics that may alter model performance. Further, we applied blur to the entire image, where in practice it may be more localised, or have varying levels of intensity throughout the image. Further work to mitigate these issues would include exploring quality issues in different contexts and experimenting on those specific domains, or using a randomised blur approach in this context.

Regarding model development practice, the training set was relatively low due to computational constraints, so increasing the training size may give better or potentially more robust results for deep learning models. Additionally, it would be interesting to examine whether using reconstructed PCA transformed images to train and test deep learning models could balance their high accuracy and robustness to augmentation.

7 Conclusion

[TODO]

References

1.
2.
3.
Luo, S. et al. Motion blur detection in radiographs. SPIE 6914, 100440 (2008).
4.
5.
6.
Shakhawat, N. et al. Automatic quality evaluation of whole slide images for the practical use of whole slide imaging scanner. ITE Transactions on Media Technology and Applications 8, 252–268 (2020).
7.
Ghazanfar, S. Preprocessing code. (2025).
8.
9.
Duchon, C. E. Lanczos filtering in one and two dimensions. Journal of Applied Meteorology and Climatology 18, 1016–1022 (1979).
10.
O’Shea, K. & Nash, R. An introduction to convolutional neural networks. (2015).
11.
Deng, J. et al. Imagenet: A large-scale hierarchical image database. in 2009 IEEE conference on computer vision and pattern recognition 248–255 (Ieee, 2009).
12.
13.
Chen, T. & Guestrin, C. XGBoost: A scalable tree boosting system. in Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining 785–794 (ACM, 2016). doi:10.1145/2939672.2939785.
14.
Brieman, L. Random forests. Machine Learning 45, 5–32 (2001).
15.
Mudrova, M. & Prochazka, A. PRINCIPAL COMPONENT ANALYSIS IN IMAGE PROCESSING.
16.
17.
18.
Bakir, et al., Weston. Learning to find pre-images. in Advances in neural information processing systems 16 449–456 (Biologische Kybernetik; Max-Planck-Gesellschaft; MIT Press, 2004).
19.